62 research outputs found

    DPI over commodity hardware: implementation of a scalable framework using FastFlow

    Get PDF
    In the last years we assisted to a large increase of the number of applications running on top of IP networks. Consequently the need to implement very efficient monitoring solutions that can manage these high data rates and that can classify the type of traffic which is traveling over the network has increased. For example, as far as network security is concerned, in the recent years we have seen a shift from so-called "network-level" attacks, which target the network they are transported on (e.g. Denial of Service), to content-based threats which exploit applications vulnerabilities and require sophisticated levels of intelligence to be detected. For some of these threats, it is no more sufficient to have only a software solution on the client side but we also need to run some controls on the network itself. To manage these kinds of scenarios, payload inspection is often required in order to correctly identify the application protocol and to process the data carried over it. This is the reason why, in recent years, Deep Packet Inspection (DPI) technology has emerged. This kind of processing is in many cases implemented, at least in part, through dedicated hardware. However, full software solutions may often be more appealing because they are typically more economical and have, in general, the capability to react faster to protocols evolution and changes. Moreover, software solutions which run over general purpose hardware do not exploit the underlying multiprocessor architecture, providing only the capability to process the incoming packets sequentially. Furthermore, many DPI research works that can be found in literature and which exploits multicore architectures are often characterized by a poor scalability, due to the overhead required for synchronization and to load unbalance among the used cores. In this thesis, we will describe the design and implementation of a DPI framework capable of managing current networks rates using commodity multicore hardware. Our framework provides the possibility to identify the protocol, to specify the kind of data to extract when it has been identified and how these data has to be processed. Differently from existing works, the developed framework has been designed according to the structured parallel programming theory, allowing thus to completely hide to the user the complexity of the management of the problems related to an efficient exploitation of the underlying architecture. These concepts have then been applied using FastFlow, a library for structured parallel programming targeting both shared memory and distributed memory architectures

    A power-aware, self-adaptive macro data flow framework

    Get PDF
    The dataflow programming model has been extensively used as an effective solution to implement efficient parallel programming frameworks. However, the amount of resources allocated to the runtime support is usually fixed once by the programmer or the runtime, and kept static during the entire execution. While there are cases where such a static choice may be appropriate, other scenarios may require to dynamically change the parallelism degree during the application execution. In this paper we propose an algorithm for multicore shared memory platforms, that dynamically selects the optimal number of cores to be used as well as their clock frequency according to either the workload pressure or to explicit user requirements. We implement the algorithm for both structured and unstructured parallel applications and we validate our proposal over three real applications, showing that it is able to save a significant amount of power, while not impairing the performance and not requiring additional effort from the application programmer

    Mammut: High-level management of system knobs and sensors

    Get PDF
    Managing low-level architectural features for controlling performance and power consumption is a growing demand in the parallel computing community. Such features include, but are not limited to: energy profiling, platform topology analysis, CPU cores disabling and frequency scaling. However, these low-level mechanisms are usually managed by specific tools, without any interaction between each other, thus hampering their usability. More important, most existing tools can only be used through a command line interface and they do not provide any API. Moreover, in most cases, they only allow monitoring and managing the same machine on which the tools are used. MAMMUT provides and integrates architectural management utilities through a high-level and easy-to-use object-oriented interface. By using MAMMUT, is possible to link together different collected information and to exploit them on both local and remote systems, to build architecture-aware applications

    A Reconfiguration Algorithm for Power-Aware Parallel Applications

    Get PDF
    In current computing systems, many applications require guarantees on their maximum power consumption to not exceed the available power budget. On the other hand, for some applications, it could be possible to decrease their performance, yet maintain an acceptable level, in order to reduce their power consumption. To provide such guarantees, a possible solution consists in changing the number of cores assigned to the application, their clock frequency, and the placement of application threads over the cores. However, power consumption and performance have different trends depending on the application considered and on its input. Finding a configuration of resources satisfying user requirements is, in the general case, a challenging task. In this article, we propose Nornir, an algorithm to automatically derive, without relying on historical data about previous executions, performance and power consumption models of an application in different configurations. By using these models, we are able to select a close-to-optimal configuration for the given user requirement, either performance or power consumption. The configuration of the application will be changed on-the-fly throughout the execution to adapt to workload fluctuations, external interferences, and/or application’s phase changes. We validate the algorithm by simulating it over the applications of the Parsec benchmark suit. Then, we implement our algorithm and we analyse its accuracy and overhead over some of these applications on a real execution environment. Eventually, we compare the quality of our proposal with that of the optimal algorithm and of some state-of-the-art solutions

    Flare: Flexible In-Network Allreduce

    Full text link
    The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solutions provide limited customization opportunities and might provide suboptimal performance when dealing with custom operators and data types, with sparse data, or when reproducibility of the aggregation is a concern. To deal with these problems, in this work we design a flexible programmable switch by using as a building block PsPIN, a RISC-V architecture implementing the sPIN programming model. We then design, model, and analyze different algorithms for executing the aggregation on this architecture, showing performance improvements compared to state-of-the-art approaches

    Analysing Multiple QoS Attributes in Parallel Design Patterns-Based Applications

    Get PDF
    Parallel design patterns can be fruitfully combined to develop parallel software applications. Different combinations of patterns can feature different QoS while being functionally equivalent. To support application developers in selecting the best combinations of patterns to develop their applications, we hereby propose a probabilistic approach that permits analysing, at design time, multiple QoS attributes of parallel design patterns-based application. We also present a proof-of-concept implementation of our approach, together with some experimental results

    Canary: Congestion-Aware In-Network Allreduce Using Dynamic Trees

    Full text link
    The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce operation, data from multiple hosts is aggregated together and then broadcasted to each host participating in the operation. Allreduce performance can be improved by a factor of two by aggregating the data directly in the network. Switches aggregate data coming from multiple ports before forwarding the partially aggregated result to the next hop. In all existing solutions, each switch needs to know the ports from which it will receive the data to aggregate. However, this forces packets to traverse a predefined set of switches, making these solutions prone to congestion. For this reason, we design Canary, the first congestion-aware in-network allreduce algorithm. Canary uses load balancing algorithms to forward packets on the least congested paths. Because switches do not know from which ports they will receive the data to aggregate, they use timeouts to aggregate the data in a best-effort way. We develop a P4 Canary prototype and evaluate it on a Tofino switch. We then validate Canary through simulations on large networks, showing performance improvements up to 40% compared to the state-of-the-art
    • …
    corecore